Softmax policy gradient methods can take exponential time to converge
نویسندگان
چکیده
Abstract The softmax policy gradient (PG) method, which performs ascent under parameterization, is arguably one of the de facto implementations optimization in modern reinforcement learning. For $$\gamma $$ γ -discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence PG methods finding a near-optimal policy. However, prior results fall short delineating clear dependencies rates on salient parameters such as cardinality state space $${\mathcal {S}}$$ S and effective horizon $$\frac{1}{1-\gamma }$$ 1 - , both could be excessively large. In this paper, we deliver pessimistic message regarding iteration complexity methods, despite assuming access to exact computation. Specifically, demonstrate that method with stepsize $$\eta η can take $$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$ | 2 Ω ( ) iterations converge, even presence benign initialization an initial distribution amenable exploration (so mismatch coefficient not exceedingly large). This accomplished by characterizing algorithmic dynamics over carefully-constructed MDP containing only three actions. Our exponential lower bound hints at necessity carefully adjusting update rules or enforcing proper regularization accelerating methods.
منابع مشابه
Gradient Descent Can Take Exponential Time to Escape Saddle Points
Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not...
متن کاملCold-Start Reinforcement Learning with Softmax Policy Gradient
Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity ...
متن کاملPolicy Gradient Methods
A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by gradient descent. It belongs to the class of policy search techniques that maximize the expected return of a policy in a fixed policy class while traditional value function approximation approaches derive policies from a value function. Policy gradient approaches have various a...
متن کاملPolicy Gradient Methods for Off-policy Control
Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy. Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ [13], converge even when using function approximation and incremental updates. However, they have been developed for the case of a fixed behavior policy. In control problems, one would ...
متن کاملPolicy-Gradient Methods for Planning
Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to thes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Mathematical Programming
سال: 2023
ISSN: ['0025-5610', '1436-4646']
DOI: https://doi.org/10.1007/s10107-022-01920-6